Bike Ridership EDA

Author

Madelyne Ventura

Published

April 10, 2023

Read in Datasets

  • GeoJSON for the bikelane data
  • Capital Bikeshare dataset for March
Code
import pandas as pd
import numpy as np
import altair as alt
import plotly.graph_objects as go
from vega_datasets import data
import requests
import json

Bikeshare Data Cleaning & Prep

Read in the bikeshare dataset. This dataset is for the month of March and was sourced from this website.

Code
bikeshare_df = pd.read_csv('../data/202303-capitalbikeshare-tripdata.csv')
bikeshare_df.head()
ride_id rideable_type started_at ended_at start_station_name start_station_id end_station_name end_station_id start_lat start_lng end_lat end_lng member_casual
0 20CAF4CAD9186B1C docked_bike 2023-03-26 16:52:35 2023-03-26 17:22:08 Washington & Independence Ave SW/HHS 31272.0 15th St & Constitution Ave NW 31321.0 38.886978 -77.013769 38.892244 -77.033234 casual
1 695D9110D59A0A42 classic_bike 2023-03-28 16:51:30 2023-03-28 17:06:32 11th & C St SE 31659.0 D St & Maryland Ave NE 31612.0 38.885908 -76.991476 38.894841 -76.995916 member
2 3E284FFCC357FBEF classic_bike 2023-03-28 19:25:55 2023-03-28 19:36:41 18th & M St NW 31221.0 California St & Florida Ave NW 31116.0 38.905067 -77.041779 38.917761 -77.040620 member
3 47CC92C4A31AB8CA classic_bike 2023-03-03 16:33:58 2023-03-03 16:38:52 17th & K St NW 31213.0 17th & K St NW / Farragut Square 31233.0 38.902760 -77.038630 38.902061 -77.038322 member
4 A4B1F3F9E8DF03A3 docked_bike 2023-03-11 14:31:24 2023-03-11 14:54:06 10th & G St NW 31274.0 2nd St & Massachusetts Ave NE 31641.0 38.898243 -77.026235 38.894972 -77.003135 casual

Look at the various datatypes

Code
print(f'Datatypes of dataset: \n{bikeshare_df.info()}')
print(f'Summary statistics of dataset: \n{bikeshare_df.describe()}')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 290430 entries, 0 to 290429
Data columns (total 13 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   ride_id             290430 non-null  object 
 1   rideable_type       290430 non-null  object 
 2   started_at          290430 non-null  object 
 3   ended_at            290430 non-null  object 
 4   start_station_name  274401 non-null  object 
 5   start_station_id    274401 non-null  float64
 6   end_station_name    272993 non-null  object 
 7   end_station_id      272993 non-null  float64
 8   start_lat           290430 non-null  float64
 9   start_lng           290430 non-null  float64
 10  end_lat             289937 non-null  float64
 11  end_lng             289937 non-null  float64
 12  member_casual       290430 non-null  object 
dtypes: float64(6), object(7)
memory usage: 28.8+ MB
Datatypes of dataset: 
None
Summary statistics of dataset: 
       start_station_id  end_station_id      start_lat      start_lng  \
count     274401.000000   272993.000000  290430.000000  290430.000000   
mean       31365.563431    31365.698564      38.902330     -77.031188   
std          270.729789      274.325101       0.024873       0.031336   
min        30200.000000    30200.000000      38.770000     -77.368416   
25%        31213.000000    31215.000000      38.890496     -77.043593   
50%        31276.000000    31275.000000      38.902314     -77.031500   
75%        31606.000000    31608.000000      38.912659     -77.013756   
max        33200.000000    33200.000000      39.125828     -76.825535   

             end_lat        end_lng  
count  289937.000000  289937.000000  
mean       38.901324     -77.030849  
std         0.024433       0.031006  
min        38.770000     -77.370000  
25%        38.890461     -77.043386  
50%        38.901136     -77.030230  
75%        38.911268     -77.013500  
max        39.125828     -76.825535  

Since the dataset timestamps are object datatypes, I will convert them into Pandas datetime.

Code
# Convert dates into datetime format
bikeshare_df['started_at'] = pd.to_datetime(bikeshare_df['started_at'])
bikeshare_df['ended_at'] = pd.to_datetime(bikeshare_df['ended_at'])
print(f'Datatypes of dataset: \n{bikeshare_df.info()}')
bikeshare_df.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 290430 entries, 0 to 290429
Data columns (total 13 columns):
 #   Column              Non-Null Count   Dtype         
---  ------              --------------   -----         
 0   ride_id             290430 non-null  object        
 1   rideable_type       290430 non-null  object        
 2   started_at          290430 non-null  datetime64[ns]
 3   ended_at            290430 non-null  datetime64[ns]
 4   start_station_name  274401 non-null  object        
 5   start_station_id    274401 non-null  float64       
 6   end_station_name    272993 non-null  object        
 7   end_station_id      272993 non-null  float64       
 8   start_lat           290430 non-null  float64       
 9   start_lng           290430 non-null  float64       
 10  end_lat             289937 non-null  float64       
 11  end_lng             289937 non-null  float64       
 12  member_casual       290430 non-null  object        
dtypes: datetime64[ns](2), float64(6), object(5)
memory usage: 28.8+ MB
Datatypes of dataset: 
None
ride_id rideable_type started_at ended_at start_station_name start_station_id end_station_name end_station_id start_lat start_lng end_lat end_lng member_casual
0 20CAF4CAD9186B1C docked_bike 2023-03-26 16:52:35 2023-03-26 17:22:08 Washington & Independence Ave SW/HHS 31272.0 15th St & Constitution Ave NW 31321.0 38.886978 -77.013769 38.892244 -77.033234 casual
1 695D9110D59A0A42 classic_bike 2023-03-28 16:51:30 2023-03-28 17:06:32 11th & C St SE 31659.0 D St & Maryland Ave NE 31612.0 38.885908 -76.991476 38.894841 -76.995916 member
2 3E284FFCC357FBEF classic_bike 2023-03-28 19:25:55 2023-03-28 19:36:41 18th & M St NW 31221.0 California St & Florida Ave NW 31116.0 38.905067 -77.041779 38.917761 -77.040620 member
3 47CC92C4A31AB8CA classic_bike 2023-03-03 16:33:58 2023-03-03 16:38:52 17th & K St NW 31213.0 17th & K St NW / Farragut Square 31233.0 38.902760 -77.038630 38.902061 -77.038322 member
4 A4B1F3F9E8DF03A3 docked_bike 2023-03-11 14:31:24 2023-03-11 14:54:06 10th & G St NW 31274.0 2nd St & Massachusetts Ave NE 31641.0 38.898243 -77.026235 38.894972 -77.003135 casual
Code
# Check number of rows in dataset
print(f'Dimensions of dataset: {bikeshare_df.shape}')
Dimensions of dataset: (290430, 13)

Next, drop NaNs from the dataset. Since there are some trips where we don’t have a starting or ending location, remove them from the dataset.

Code
bikeshare_df.dropna(subset=['start_station_name'], inplace = True)
bikeshare_df.dropna(subset=['end_station_name'], inplace = True)
print(f'Dimensions of dataset: {bikeshare_df.shape}')
Dimensions of dataset: (266570, 13)

Since Capital Bikeshare bikes start tracking the bikes at their exact latitude and longitude wherever they are located at the station, there are tiny differences between the latitude and longtitude. I am going to standardize the latitude, longitude for both the starting and ending locations.

Code
# Standardize longitude and latitude using start station
bikeshare_df['start_lng'] = bikeshare_df['start_lng'].groupby(bikeshare_df['start_station_id']).transform('max')
bikeshare_df['start_lat'] = bikeshare_df['start_lat'].groupby(bikeshare_df['start_station_id']).transform('max')
Code
# Create dataframe for joining
tmp = bikeshare_df[['start_station_id', 'start_lng','start_lat']]
tmp.drop_duplicates(inplace = True)
print(tmp.start_station_id.nunique())
720
C:\Users\madel\AppData\Local\Temp\ipykernel_21072\4061565468.py:3: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Now that the start longitude and latitude are standardized, I will then make sure that the end latitude and longitude values match for the same stations.

Code
# Merge using the common station id value
bikeshare_df = bikeshare_df.merge(tmp, left_on = 'end_station_id', right_on = 'start_station_id')
Code
# Drop repeated columns and rename them
bikeshare_df.drop(columns = ['end_lat', 'end_lng', 'start_station_id_y'], inplace = True)
bikeshare_df.rename(columns = {'start_lat_x': 'start_lat', 'start_lng_x': 'start_lng', 'start_lat_y': 'end_lat', 'start_lng_y':'end_lng', 'start_station_id_x': 'start_station_id'}, inplace = True)

Let’s check that the ending and starting stations have unique values for the latitude and longitude values. These should all be values of 1, since we only want a 1-to-1 match.

Code
print(f"Checking max nunique value in start lng: {max(bikeshare_df.groupby('start_station_id')['start_lng'].nunique())}")
print(f"Checking max nunique value in start lat: {max(bikeshare_df.groupby('start_station_id')['start_lat'].nunique())}")
print(f"Checking max nunique value in end lat: {max(bikeshare_df.groupby('end_station_id')['end_lat'].nunique())}")
print(f"Checking max nunique value in end lng: {max(bikeshare_df.groupby('end_station_id')['end_lng'].nunique())}")
Checking max nunique value in start lng: 1
Checking max nunique value in start lat: 1
Checking max nunique value in end lat: 1
Checking max nunique value in end lng: 1

Now that the data is prepped, I am going to create a list of non-DC stations. The Capital Bikeshare dataset includes all bikeshare stops within the D.C. Metro area and since our project focuses only on D.C., I am going to exclude bike stations outside of D.C.

Code
# Create list of bikeshare stations outside of DC
nondc_stations = [
    32256,32251,32237,32241,32210,32225,32259,32223,32209,32240,32239,32245,32220,32214,32219,
    32224,32217,32213,32239,32246,32247,32250,32248,32246,32228,32215,32238,32252,32249,32260,
    32234,32231,32235,32255,32200,32208,32201,32211,32227,32207,32229,32221,32206,32233,32205,
    32204,32205,32203,32206,32222,32230,32232,32600,32602,32603,32608,32605,32604,32607,32609,
    31948,31904,32606,32601,31921,31905,31902,31901,31976,31036,31977,31900,31920,31049,31037,
    31926,31919,31035,31973,31069,31023,31022,31021,31019,31020,31094,31092,31079,31030,31029,
    31080,31093,31014,31062,31077,31073,31024,31040,31028,31017,31924,31027,31947,31066,31075,31949,31053,31971,31067,31058,31923,31063,31068,31951,31945,31095,31006,31005,31091,31004,
    31936,31071,31090,31950,31064,31935,31011,31012,31009,31944,31052,31010,31959,31916,31088,
    31960,31956,31910,31083,31915,31087,31085,31913,31915,31970,31969,31906,31098,31048,31081,
    31084,31082,31974,31930,31932,31953,31942,31967,32406,32423,32415,32407,32405,32401,32400,
    32405,32404,32413,32418,32410,32403,32408,32421,32402,32417,32422,32420,32414,32412,32416,
    32059,32061,32026,32011,32049,32082,32058,32025,32001,32058,32082,32024,32043,32036,32012,
    32034,32035,32050,32056,32426,32425,32424,32426,32085,32094,32089,32093,32091,32090,32087,
    32088,32086,32092,32022,32066,32064,32062,32065,32073,32063,32084,32054,32051,32040,32046,
    32029,32055,32002,32021,32003,32048,32013,32000,32008,32028,32027,32053,32039,32057,32078,
    32075,32077,32076,32079,32080,32074,32081,32032,32047,32044,32017,32007,32009,32023,32033,
    32016,32004,32005,32072,32041,32052,32071,32038,32037,32045,32067,32069,32068,32018,32253,
    32236,32243,32258,32216,32212,32218,32019,32411,31929,31914,31907,31903,31958,31933,31041,
    31042,31968,31044,31045,31955,31046,31047,31099,31043,31097,31931,31918,31086,31927,31966,
    21943,31963,31952,31964,31962,31908,31072,31941,31961,31928,31054,31033,31059,31057,31061,
    31056,31055,31909,31912,31065,31032,31074,31078,32419,31957,31954,31946,31972,31060,31938,
    31013,31002,31007,31000,31003,31096,31070,31039,31034,31025,31038,31026,31050,31940,31089,
    31031,31051,31937,31016,31018,31039,31015,31917,31076,31939,32409
]

EDA using Altair

First, let’s look at the top bike stations in D.C.

Code
# Remove limit for Altair
alt.data_transformers.enable('default', max_rows = None)
DataTransformerRegistry.enable('default')

The table below shows the most common places people begin their trips using bikeshare.

Code
# Create groupby of top 15 most popular bike stations
grouped_df = bikeshare_df.groupby('start_station_name').agg({'ride_id': 'count'}).reset_index()
grouped_df.rename(columns={'ride_id': 'count_rides'}, inplace = True)
# Keep only top 15
grouped_df = grouped_df.sort_values('count_rides', ascending = False).head(15)
grouped_df
start_station_name count_rides
282 Columbus Circle / Union Station 3441
508 New Hampshire Ave & T St NW 2819
434 Lincoln Memorial 2767
61 15th & P St NW 2649
615 Smithsonian-National Mall / Jefferson Dr & 12t... 2574
395 Jefferson Dr & 14th St SW 2371
178 5th & K St NW 2350
396 Jefferson Memorial 2335
49 14th & V St NW 2305
105 1st & M St NE 2249
78 17th St & Independence Ave SW 2144
174 4th St & Madison Dr NW 2084
324 Eastern Market Metro / Pennsylvania Ave & 8th ... 2064
389 Henry Bacon Dr & Lincoln Memorial Circle NW 1984
459 Massachusetts Ave & Dupont Circle NW 1902

We can show this visually using a bar graph in Altair

Code
# Create selection
selection = alt.selection_single(fields=['start_station_name'],name='Random')
color = alt.condition(selection,
                      alt.Color('start_station_name:N', scale= alt.Scale(scheme="accent"), title = "Station Name"),
                      alt.value('lightgray'))
# Make bar graph
bar=(alt.Chart(bikeshare_df[bikeshare_df['start_station_name'].isin(grouped_df['start_station_name'].to_list())])
 .mark_bar()
 .encode(y='count(ride_id):Q',
         x=alt.X('start_station_name:N',
         sort=alt.EncodingSortField(field='ride_id', op='count', 
                            order='descending')),
         color=color,
         tooltip=['start_station_name:N', 'count(ride_id):Q']
        )
).add_selection(selection)

bar.title ="Top 15 Capital Bikeshare Stations"
bar.encoding.x.title = 'Station'
bar.encoding.y.title = 'Count of Rides in March 2023'
bar
C:\Users\madel\Documents\Anaconda\envs\anly503\lib\site-packages\altair\utils\core.py:317: FutureWarning:

iteritems is deprecated and will be removed in a future version. Use .items instead.

Bikeshare Route Visualization

The first visual just plots the station locations without a background.

Code
# Filter non-DC stations
tmp = bikeshare_df[~bikeshare_df['start_station_id'].isin(nondc_stations)]
tmp = tmp[~tmp['end_station_id'].isin(nondc_stations)]
tmp = tmp[['start_station_name', 'start_lng', 'start_lat']].drop_duplicates()

# Map of points for ride share locations
points = alt.Chart(tmp).mark_circle().encode(
    longitude='start_lng:Q',
    latitude='start_lat:Q',
    size=alt.value(10),
    tooltip=['start_station_name:N']
)

points
C:\Users\madel\Documents\Anaconda\envs\anly503\lib\site-packages\altair\utils\core.py:317: FutureWarning:

iteritems is deprecated and will be removed in a future version. Use .items instead.

The code in the cell below creates a background of Washington D.C. This will be used for the different visuals.

Code
# Function to download geojson file
response1 = requests.get('https://raw.githubusercontent.com/arcee123/GIS_GEOJSON_CENSUS_TRACTS/master/11.geojson')

# Create background of D.C.
background = alt.Chart(alt.Data(values=response1.json())).mark_geoshape(
        fill="lightgray",
        stroke='white',
        strokeWidth=1
    ).encode(
    ).properties(
        width=600,
        height=600
    )

background

Now we can combine point Altair layers into one plot that shows the bikeshare stations on a map of D.C.,

Code
background + points

Using this methodology, I am going to create a dynamic visualization that is made up of the following three components:

  • Washington D.C. Background
  • Capital Bikeshare Stations
  • Capital Bikeshare Routes Taken in D.C. in March 2023
Code
#### BACKGROUND FOR MAP 

# Define background of Washington D.C.
response1 = requests.get('https://raw.githubusercontent.com/arcee123/GIS_GEOJSON_CENSUS_TRACTS/master/11.geojson')

background = alt.Chart(alt.Data(values=response1.json())).mark_geoshape(
        fill="lightgray",
        stroke='white',
        strokeWidth=1
    ).encode(
    ).properties(
        width=600,
        height=600
    )

#### MOUSEOVER SELECTION

# Create mouseover selection
select_station = alt.selection_single(
    on="mouseover", nearest=True, fields=["start_station_name"], empty='none'
)

#### NETWORK CONNECTIONS FOR MAP 

# Filter non-DC stations
tmp1 = bikeshare_df[~bikeshare_df['start_station_id'].isin(nondc_stations)]
tmp1 = tmp1[~tmp1['end_station_id'].isin(nondc_stations)]

# Keep only relevant columns and drop duplicates to have one row per route
tmp1 = tmp1[['start_station_name', 'start_station_id', 'end_station_name', 'end_station_id', 'start_lat', 'start_lng', 'end_lat', 'end_lng']].drop_duplicates()

# Define connections
connections = alt.Chart(tmp1).mark_rule(opacity=0.35).encode(
    latitude="start_lat:Q",
    longitude="start_lng:Q",
    latitude2="end_lat:Q",
    longitude2="end_lng:Q"
).transform_filter(
    select_station
)

#### POINTS FOR MAP 

# Filter non-DC stations
tmp2 = bikeshare_df[~bikeshare_df['start_station_id'].isin(nondc_stations)]
tmp2 = tmp2[~tmp2['end_station_id'].isin(nondc_stations)]

# Temporary dataframe showing unique station locations with ride count
tmp2 = tmp2[['start_station_name','start_station_id', 'start_lng', 'start_lat', 'ride_id']].groupby(['start_station_name', 'start_station_id','start_lng', 'start_lat']).agg({'ride_id': 'count'}).reset_index()
tmp2.rename(columns= {'ride_id':'count_rides'}, inplace = True)

points = alt.Chart(tmp2).mark_circle().encode(
    latitude="start_lat:Q",
    longitude="start_lng:Q",
    size=alt.Size("count_rides:Q", scale=alt.Scale(range=[15, 250]), legend=None),
    order=alt.Order("count_rides:Q", sort="descending"),
    tooltip=["start_station_name:N", "start_station_id:Q", "count_rides:Q"]
).add_selection(
    select_station
)

# Show visualization
(background + connections + points).configure_view(stroke=None)
C:\Users\madel\Documents\Anaconda\envs\anly503\lib\site-packages\altair\utils\core.py:317: FutureWarning:

iteritems is deprecated and will be removed in a future version. Use .items instead.